Titanic Analysis and Prediction

Table of Contents:

1. Light Analysis

2. Visualizations

3. Feature Engineering

4. Cleaning Data

5. Modeling

6. Refining Model

7. Creating Submission File

1. Light Analysis

This will be a high level obseration of the data to understand the general shape and correlations. We can also start developing some ideas for potential feature engineering.

Observations and Prediction

General:

Numeric and categorical features:

Data types:

Missing data:

Prediction
Now that we know what types of values we are working with let's make a few predictions:

Let's create a few plots to see if we are close. We will plot the count for both numeric and categoric columns as well as the number of survivors.

2. Visualizations

I will plot the distributions/ counts of each column, followed by the survivors.

General Observations: It looks like for the most part, the number of survivors is highly dependent on the sheer volume of people in a specific category. Just looking at survival numbers of course is not fair as volume inflates survival so we will take a look at percentages later on.

Noteworthy Categories: Sex and class are the only two categories that have a strong correlation to survival rate. It also looks like babies have a fairly high survival rate. We will take a more in depth look in the next section.

3. Feature Engineering

  1. Sex: Is sex the largest determining factor?
  2. Pclass: How strong is the correlation between class and survival?
  3. Age: What ages are the most likely to survive
  4. SibSp and Parch: Does travelling alone or with a group impact survival rate?

We will validate our original predictions.

1. Sex

The survival rate discrepancies between genders are massive. It is highly likely this is a key determining factor for a passenger's survival.

2. Pclass

The correlation between class and survival is quite strong and completely linear. I believe that class and sex are the two main factors that determine survival rate.

3. Age

We can see that the most concentrated survival areas are people between 15 and 30. However, this is also the category where most people die. This does not tell us much as this seems very dependent on the volume of people in each age category. To combat this, we need to sort everyone into different age categories and determine the survival %.

Other observations: all death rates are at least a shade darker than the survival rates except for ages 0-10. That means that more people died than survived in every category except for little kids. Let's look into this.

It seems like there is not much of a correlation between age and survival rate; about 50% foreach age range. However, we can see that the survival rate for children was very high. The rate for ages 73-80 is also high but looking at our initial age plot, it can be seen that there were a very low number of data points making the value look inflated. By looking at the error bar, it is clear that the rate for ages 73-80 is not reliable. We will drop this later on when cleaning data.

4. SibSp & Parch

These results are very interesting and unexpected. It looks like travelling with 1 or 2 people increases the survival rate as all 3 people (passenger and two others) would be able to work together and help each other. This is further proven by the small error bars. Travelling with more than that however becomes a problem as we see the survival rate drop from 46% - 25%.

4. Cleaning Data

In this section we will address missing values and prepare our training data for modelling.

Across the train and test data, there are 1309 entries. 265 (20%) age values are missing and 2 (0.15%) embarked values are missing.

Age: Age doesn't seem to have a very high correlation with survival rate so for now we can just fill in the missing data with the median age.

#TODO: Predict age based on Title, SibSp, Parch

Embarked: Fill with S

Nice all our values are filled. We can finally start modelling.

5. Modelling

We will test an assortment of different machine learning models from simplest (at least in my mind) to most complex.

  1. Decision Tree
  2. Random Forest
  3. Naive Bayes
  4. Logistic Regression
  5. KNN (K Nearest Neighbor)
  6. Support Vector Machines
  7. Perceptron
  8. Stochastic Gradient Descent
  9. Xtreme Gradient Boosting Classifier

Of all the models, Gradient Boosting Classifier has the highest score. Of course, that does not mean it is the best model as it may only work well for our specific test data however we will select this model to refine.

6. Refining Model

From the documentation, the two most important parameters for Gradient Boosting Classifier are max depth and the number of estimators. Let's test a few cases using values that are close to the default.

As predicted, depth has a slightly higher affect on the model than the number of estimators. From all three charts, we can see that the ideal values are a depth of 3 and 100 estimators. With this in mind, lets build our final model and make a prediction!

7. Creating Submission File